In [ ]:
%%HTML
<style>
.container { width:100% }
</style>

Gender Estimation for First Names

This notebook gives a simple example for a naive Bayes classifier. We try to predict the gender of a first name. In order to train our classifier, we need a training set of names that are marked as being either male. We happen to have two text files, names-female.txt and names-male.txt containing female and male first names. We start by defining the function read_names. This function reads a file of strings and returns a list of all the names given in the file. Care is taken that the newline character at the end of each line is discarded.


In [ ]:
def read_names(file_name):
    Result = []
    with open(file_name, 'r') as file:
        for name in file:
            Result.append(name[:-1]) # discard newline
    return Result

In [ ]:
FemaleNames = read_names('names-female.txt')
MaleNames   = read_names('names-male.txt')

Let us compute the prior probabilities $P(\texttt{Female})$ and $P(\texttt{Male})$ for the classes $\texttt{Female}$ and $\texttt{Male}$. In the lecture it was shown that the prior probability of a class $C$ in a training set $T$ is given as: $$ P(C) \approx \frac{\mathtt{card}\bigl(\{t \in T \;|\; \mathtt{class}(t) = C \}\bigr)}{\mathtt{card}(T)} $$ Therefore, these probabilities are computed as follows.


In [ ]:
pFemale = len(FemaleNames) / (len(FemaleNames) + len(MaleNames))
pMale   = len(MaleNames)   / (len(FemaleNames) + len(MaleNames))
pFemale

As a first attempt to solve the problem we will use the last character of a name as its feature. We have to compute the conditional probability for every possible letter that occurs as the last letter of a name. The general formula to compute the conditional probability of a feature $f$ given a class $C$ is the following: $$ P(f\;|\;C) \approx \frac{\mathtt{card}\bigl(\{t \in T \;|\; \mathtt{class}(t) = C \wedge \mathtt{has}(t, f) \}\bigr)}{ \mathtt{card}\bigl(\{t \in T \;|\; \mathtt{class}(t) = C \}\bigr)} $$ The function conditional_prop takes a character $c$ and a gender $g$ and determines the conditional probability of seeing $c$ as a last character of a name that has the gender $g$.


In [ ]:
def conditional_prop(c, g):
    if g == 'f':
        return len([n for n in FemaleNames if n[-1] == c]) / len(FemaleNames)
    else:
        return len([n for n in MaleNames   if n[-1] == c]) / len(MaleNames)

Next, we define a dictionary Conditional_Probability. For every character $c$ and every gender $g \in \{\texttt{'f'}, \texttt{'m'}\}$, the entry $\texttt{Conditional_Probability}[(c,g)]$ is the conditional probability of observing the last character $c$ if the gender is known to be $g$.


In [ ]:
Conditional_Probability = {}
for c in 'abcdefghijklmnopqrstuvwxyz':
    for g in ['f', 'm']:
        Conditional_Probability[c, g] = conditional_prop(c, g)

Now that have have both the prior probabilities $P(\texttt{'f'})$ and $P(\texttt{'m'})$ and also all the conditional probabilities $P(c|g)$, we are ready to implement our naive Bayes classifier.


In [ ]:
def classify(name):
    last   = name[-1]
    female = Conditional_Probability[(last, 'f')] * pFemale
    male   = Conditional_Probability[(last, 'm')] * pMale
    if female >= male:
        return 'f'
    else:
        return 'm'

We test our classifier with two common names.


In [ ]:
classify('Christian')

In [ ]:
classify('Elena')

Let us check the overall accuracy of our classifier with respect to the training set.


In [ ]:
total   = 0
correct = 0
for n in FemaleNames:
    if classify(n) == 'f':
        correct += 1
    total += 1
for n in MaleNames:
    if classify(n) == 'm':
        correct += 1
    total += 1
accuracy = correct / total
accuracy

An accuracy of 76% is not too bad for a first attempt, but we can do better by using more sophisticated features.


In [ ]: